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ABSTRACT 


The effectiveness for program development of the M.I.T. Compati- 
ble Time-Sharing System was compared with that of the IBM IBSYS 
batch-processing system by means of a statistically designed ex- 
periment. An identical set of four programming problems was as- 
signed to each of a group of four programming subjects. Influences 
external to the systems such as the sequence of problem solution, 
and programmer and problem characteristics were specified as de- 
sign factors in the experiment. Data was obtained for six variables 
(e.g. programmer time, computer time, elapsed time, etc.) which 
were considered to be definitive of "system effectiveness", and 
analysis of variance techniques were employed to estimate system 
differences in these variables after differences due to the design 
factors had been eliminated. Statistical analysis of the experi- 
mental results provided strong evidence of important system dif- 
ferences, as well as a critique of the experimental design itself 
with implications for further experimentation. 
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I, INTRODUCTION 


Inasmuch as the multiplicity of operational time-shared comput- 
ing systems has long since dispelled any doubts of their feasibility, 
time-sharing research is now largely centered upon development of 
techniques for increasing the utility and effectiveness of such systems. 
Concomitant with this developmental research effort is an evaluative 
problem of obtaining measures of the effectiveness of such systems 
in the problem solving context. The task is hampered not only by 
the inherent vagueness and ambiguity of the intuitive notion of effect- 
iveness, but by the apparent lack as well of a readily available tech- 
nique for its measurement. For what is required is not only a 
measure of the efficiency of the programming system per se, but 
rather a measure, inaddition, of effectiveness of the total man-ma- 
chine interaction. In order to achieve this result, observations and 
measurements must include the performance and behavior of the 
individual user and the conditions of his activity. 

Moreover, time-sharing lends itself to the facilitation of a 
variety of computer applications, some of which require so high a 
degree of interaction with a large scale computer as to render them 
infeasible except under time-sharing. In order to obtain an effect- 
ive measure it is therefore necessary to limit the context of in- 
quiry and to focus upon a specifiable application for the evaluation. 
Clearly the most general application and one that must be accomplished 


efficiently is that of program development, and an initial investigation 
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is appropriately restricted to this context. But there remains the 
need also for specification of a standard of comparison, the obvious 
choice being batch-processing. 

The search for a resolution to this problem led to the 
choice of an operational definition of "effectiveness" in terms of 
various proposed measurements. Such measurements will obviously 
be influenced by effects external to the systems such as programmer 
aptitude, learning and problem characteristics. A statistically de- 
signed experiment was therefore constructed in order to isolate 
these effects and thereby provide meaningful comparisons of the 
relative effectiveness of a time-sharing system with that of a more 
customary system of batch-processing. Successful isolation of ex- 
ternal effects can reduce statistical variability sufficiently to permit 
attainment of a given level of precision with a much smaller sample 
than would have been required otherwise. 

It can be useful and informative to carry on this type of 
investigation even with currently operating specially designed, 
time-sharing systems that bear a cost penalty for supplementary 
special equipment needed to adapt, for time-sharing, processors 
designed for sequential batch-processing. Such evaluations can 
provide guidance for later application of similar experimental tech- 
niques to more advanced systems, which might be expected to ex- 
hibit better cost-performance characteristics than the specially 
adapted processor used in this study. Moreover, the utility of the 


system in terms of the facilities provided for the programmer, need 


tee 


in no way be diminished by implementation in current equipment; ... 
"the essence of a useful time-sharing system lies in the program- 
ming, i.e., in the software, and not in the hardware.'! OJ 

A controlled experiment was conducted in the late summer 
of 1965, using a typical batch-processing scientific computing 
system (IBM 7094-2 IBSYS) and a flexible time-sharing system pro- 
viding production applications (the M.I. T. Compatible Time-Sharing 
System for the IBM 7094-1.). Four programming subjects were 
selected from technically trained undergraduate students with high 
programming aptitude. Each individual was assigned an identical 
set of four problems, two to be coded under time-sharing and two 
under batch-processing. The four assigned problems were typical 
of library or system subroutines involving development, implementa- 
tion, and testing of euaguanae. All subjects had some prior pro- 
gramming experience and received a review of IBM 7094 batch-pro- 
cessing techniques, a brief orientation on usage of the IBM 1050 
console with the Compatible Time-Sharing System (CTSS) and a 


summary of the command language for that system. 
Il. EXPERIMENTAL DESIGN 


Comparisons of system effectiveness for any two computing 


systems are complicated by numerous factors the effects of which 





1. Two of the problems were largely numerical, one involving 
Monte Carlo integration and another, algebraic sorting. The 
other two problems were essentially of a logical nature, one 
of them an English to Pig Latin translator and the other a text 
format conversion. 


enge 


are difficult to identify and to measure. To start with, the definition 


of "system effectiveness" is itself open to debate, so that a number 


of possible measurements relating to this loosely defined concept 


have been considered, viz: 


Elapsed time - total working days from start to 
completion of each problem, 

Analysis time - total time in minutes spent by each 
programmer in eoegrareming, analysis and debugging 
of each problem. 

Programmer's time - total time in minutes spent by 
each programmer on each problem. This includes 
analysis time plus such items as keypunching and 
console time. 

Computer time - total computer time in minutes for 
each problem. 

Number of compilations - number of attempted com- 
pilations for each problem solution. 

Total cost“ - cost in dollars, for programmer and 


equipment times, required for each problem solution. 


i.) 


Having settled upon these measures or response variables 


as useful indicators for comparing the two computational techniques e 


2. 


Cost estimates in the experiment were based upon somewhat 
idealized systems which included in both the batch and time- 
sharing operation only that equipment required to provide the 
level of service afforded to the programming subjects during 
the experiment, and omitted any actual equipment that served 
only a highly specialized or experimental function. Cost data 
was derived from computer rental and programmer salary 
estimates; overhead costs were disregarded for both systems. 
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it is essential to try to eliminate the effects of external factors the 
influence of which might be so great as to obscure the comparisons 
of primary interest, namely those pertaining to the various proposed 
measures of system effectiveness. 

In considering the types of measures employed in this study, 
such as computer and programmer time expenditures, it is immedi- 
ately apparent that these can be directly influenced by differences in 
individual programmers and the particular choice of problems, Ad- 
ditionally, since there could be a learning effect from the programming 
of one problem to the next, the order of problem handling within each 
system might also be a relevant factor. In order to estimate the 
system effect differences, independently of differences in the afore- 
mentioned factors (i.e., individuals, problems, and order), a modi- 
fied Graeco-Latin Square design? was adopted. The layout of this 
particular design is shown in Table II-1. 

Examination of the design reveals that each programmer 
coded the same set of four problems, two under time-sharing and 
two under batch-processing. Furthermore, each problem was coded 
twice under each system and the sequential order of problem hand- 
ling by each programmer was different, so that each problem was 
the first coded by one programmer, the second coded by another, 


etc. Each problem was completed before the next was begun. 
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3. See {2] for discussion of Graeco-Latin Squares. 
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Programmers 


TABLE II-1 


Experimental Design 





Problems 
1 2 3 4 
1 Ts By T, B, 
2 
B, T, By ty 
3 B T B T 
4 3 1 2 
4 z B T B 
3 4 2 1 


NOTATION: B denotes batch-processing 
T denotes time-sharing 


The subscripts denote the sequence of 
problem handling for each programmer. 


It should be noted that the design is orthogonal with respect 
to the main effects, which are assumed to be additive. Thus, the 
design permits independent estimation of all the main effects (i.e., 
effects due to differences in systems, programmers, problems 


and order of problem solution within system). 
TI. ANALYSIS AND INTERPRETATION OF RESULTS 


The observations obtained from the experiment are shown 
in Table III-1. Summaries of these results for each of the design 
factors are given in Table III-2, together with the observed signifi- 
cance levels as calculated in the analysis of variance’, An observed 
significance level is the probability of observing an F value as large 
as or larger than the one computed if there is no difference in the re- 
sponse variable with regard to the design factor. Thus, avery small 
observed significance level would cast doubt upon the hypothesis of no 
difference in the response variable due to the particular design factor; 
for example, referring to Table III-2, the observed difference in 
programmer's time for the two system (i.e., 5672 minutes for time- 
sharings vs. 2737 minutes for batch) may be considered indicative of 
a basic difference in the systems, since the observed significance level 
is only .019, 

Similarly, the number of attempted compilations (118 for 


time-sharing vs. 49 for batch) appears to be significantly different 


4, See (2] for discussion of the analysis of variance. 
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Programmers 


TABLE III-1 


Experimental Results 

















Problems: 1 2 3 
T, By T, B, 
Elapsed Time(Days) 6.5 4.0 10.0 9.0 
Analysis Time (Min. ) 450 295 915 420 
1 Programmer Time (Min.) 810 355 1250 533 
Computer Time (Min. ) 23.3 9.6 21.9 13.5 
No. of Compilations 38 4 26 8 
Total Cost(Dollars) 368.52 - 107. 35 370. 81 152. 65 
B T T 
1 2; By 3 
Elapsed Time(Days) 4.0 0.5 8.0 35,0 
Analysis Time (Min. ) 152 60 95 300 
2 Programmer Time(Min.) 195 75 115 355 
Computer Time(Min. ) 16,2 0.7 10.8 3.6 
No. of Compilations 6 1 6 4 
Total Cost(Dollars) 160. 94 13. 60 106. 55 68.43 
B T B T 
emesis ee ee eee 
Elapsed Time(Days) 3.0 0.5 5.0 3.0 
Analysis Time (Min. ) 310 60 486 550 
3 Programmer Time(Min.) 340 145 537 890 
Computer Time(Min. ) 6.0 3.0 25.2 12.1 
No. of Compilations 3 2 7 20 
Total Cost(Dollars) 732.00 49.48 262, 04 214, 84 
ua B B 
3 4 T 1 
Elapsed Time(Days) 4.0 5.0 2.0 8.0 
Analysis Time(Min. ) 563 95 161 442 
4 Programmer Time(Min.) 1369 110 778 552 
Computer Time(Min. ) 13,7 8.4 i357 10.8 
No. of Compilations 13 4 14 ll 
Total Cost(Dollars) 261, 32 83.90 231.77 128. 40 


T_ denotes that the programmer's ith problem was handled 
1 under time-sharing. 

B, denotes that the programmer's jth 

+ under batch-processing. 


problem was handled 
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Elapsed Time 
Analysis Time 
Programmer Time 
Computer Time 

No. of Compilations 
Total Cost 


Elapsed Time 
Analysis Time 
Programmer Time 
Computer Time 

No. of Compilations 
Total Cost 


Elapsed Time 
Analysis Time 
Programmer Time 
Computer Time 

No. of Compilations 
Total Cost 


Elapsed Time 
Analysis Time 
Programmer Time 
Computer Time 

No. of Compilations 
Total Cost 


Summary of Experimental Results by Design 


d 


Time-sharing 


29.5 
3059 
5672 

92 

118 
1578.77 


Time- oheere 


Ist 2n 
12, 0 T7055 
1221 1838 
2553 3119 
49.8 42.2 
73 45 


828.73 750.04 


29.5 
2080 
2948 
68. 3 


999. 33 


A He) 
1475 
2714 
BO 
60 
863.78 


TABLE III-2 


Factor 


15.5 
607 
740 
31.3 
17 
349.52 


10.0 
510 
685 
21.7 
11 
254, 33 


1. Systems 


= observed signficance level 


Batch F ol 
46 4,40 . 081 
2.295 1. 08 >. 200 
2737 10. 04 .019 
100.5 <1 >. 200 
49 6. 23 . 047 
1074. 83 4,34 . 082 
2, Order Within Systems 
Batch 
ob Ist 2nd F ol 
>. 200 21.0 25.0 <1 »>.200 
1 ».200 1375 920 <1 = ».200 
>. 200 1639 1098 <1 ».200 
>. 200 61.8 38.7 2.66 .154 
5%. 200 28 21 <1 >.200 
7. 200 658. 73 416.10 2.01 5.200 
3. Programmers 
3 4 F ob 
11.5 19.0 3.85 « O75) 
1406 1261 2.70 139) 
1912 2809 4,89 . 047 
46.3 46.6 2.31 176 
32 42 3.28 .101 
599. 36 705.39 4.95 . 046 
4, Problems 
3 4 F ob 
25.0 23.0 2.91 » 123 
1657 W712 2,33 .174 
2680 2330 4,30 . 061 
71.6 40.0 4,78 .050 
53 43 2.46 . 160 
971.17 564, 32 7.10 021 


for the two systems at the . 047 observed significance level. Addi- 
tionally, somewhat higher significance levels of .08, correspond- 
ing to system differences in elapsed time (50% higher for batch- 
processing) and total cost (50% higher for time-sharing), were ob- 
served, It did not appear that there were any significant system 
differences with respect to computer time or analysis time. It 
should be noted that the experiment was designed in such a way that 
comparisons of these two systems are independent of any effects 
which might be attributable to the other design factors, namely 
programmers, problems and order. As we shall see, some of these 
effects were so large that the system differences might have been 
disguised had the experimental design not allowed for their isolation. 
Further examination of Table III-2 facilitates identification 
of the other design factors which appear to effect significant differ- 
ences upon one or more of the response variables, as judged by 
their accompanying observed signficance levels. For example, dif- , 
ferences in total cost (as great as 3 to 1) and programmer time (as 
great as 4 to 1) among the different programmers appear to be signi- 
ficant, despite the fact that all of the programmers had similar 
formal technical undseunadeate backgrounds, and each received an 
A grade on the IBM Data Processing Aptitude Test. Table III-2 
also reveals large and apparently significant differences in pro- 
grammer's time, computer time and total cost due to the effect of 
the different problems. The order of processing problems on each 


system had no apparent effect upon any of the response variables. 
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As might be expected, the six reponse variables chosen 
for this experiment are not independent, and hence the observed 
significance levels for the six variables are not independent. The 
interdependencies of the response variables are summarized in 
Table III-3, which is the matrix of partial correlation coefficients 
for the six variables after eliminating the effects due to the design 
factors. For example, it is interesting to note that the correlation 
coefficient between programmer time and computer time, after 
eliminating the effects of differences in programmers, problems, 
systems and order from both, is only . 18. 

Among the virtues of a time-sharing system is the availa- 
bility of selective console debugging techniques, and for this an ela- 
borate battery of diagnostic tools have been developed. But these 
techniques are truly available only to one already trained in their 
use. Our subjects, lacking such facilities, were constrained to 
employ under time-sharing the same habits that had been evolved 
effectively to cope with batch-processing operating conditions, i.e., 
desk debugging, recompilation, and repeated execution. Moreover, 
the very availability of the time-sharing console makes for the like- 
lihood of abuse in this mode of operation for there is far less con- 
straint to correct at any time a maximum of programming blunders 
when the opportunity for immediate compilation and test is always 
present. 

Indeed, as shown in Table III-4, the number of compila- 


tions under time-sharing was more than double that experienced 


Sie 


TABLE III-3 


Partial Correlations Among Response Variables, After Eliminating the 
ect of the Design 


Elapsed Time 
Analysis Time 
Programmer Time 
Computer Time 

No. of Compilations 


Total Cost 


Elapsed Analysis 
Time Time 
1, 00 54 
54 1,00 
. 23 . 80 
53 . 28 
. 83 : 30 
. 64 49 


=e 


actors 


Program- 
mer Time Time 


1. 


“23 


. 80 


00 


. 18 
01 


. 38 


Computer No. of 


ql. 


29 
28 
. 18 


00 


. 60 
«95 


Compi- 
lations 


. 83 
. 30 
-.01 
. 60 


72 


Total 
Cost 
. 64 
49 
. 38 
295 
72 


1. 00 


& 


TABLE III-4 


Comparison of Two Systems 


Time- Batch- 

Sharing Processing T/B 
Computer Time 92 100.5 92 
(minutes) 
Number of 118 49 2.41 
Compilations 
Computer Time/ . 78 2.05 . 38 
Compilations 
(minutes) 
Cost/Compilation 13 22 .59 
(Dollar) 
Programmer's 48 56 . 86 
Time /Compilation 


(minutes) 


2 T3 = 


under batch-processing. Thus, normal program debugging tech- 
niques seem to be wasteful under time-sharing for they apparently 
result in excessive compilations. However, the system efficiency 
of CTSS seemed sufficient to compensate for this increase, since “ 
the computer time per compilation under CTSS was only 38% as « 
great as that experienced under the batch system. 
In comparing computer time for the two systems, it should 
be noted that the time-sharing system is implemented on a 2 micro- 
second cycle 7094-1, while the batch-processing system is imple- 
mented on a 1.4 microsecond 7094-2. Furthermore, the time- sharing 
system does not utilize dynamic relocation techniques, so that 
memory must be continually reconstituted for each user processing 


cycle, 


IV. IMPLICATIONS FOR THE DESIGN OF FUTURE EXPERIMENTS 
ee SARE RIM EIN EO 


Scientific endeavor is essentially an iterative process involv- 


"; 


ing experimentation, observation and continual re-evaluation of 


hypotheses based on accumulated experience. Information acquired at 


Site 


particular phases of this process provides bases for directing the 
course of subsequent phases. Thus, results from this initial small- 
scale experiment, limited in scope to the comparative assessment 

of two particular systems, bear not only upon the measures themselves 
of system effectiveness, but apply with equal validity to a critique 

of the general assumptions upon which the experiment was based. A 
number of useful observations can thus be made concerning the 


design of future experiments of this general nature. 


1, The variation attributable to problem and programmer 
differences (cf. Table III-2) is of sufficient magnitude to suggest in- 
clusion of these factors in the design of future experiments in order to 


separate such effects from the system characteristics of interest. 


2, The learning effect, as measured by the variation due to 
order of processing the different problems, appeared to be negligible 
in the experiment relative to the other factors being measured, One 
might however anticipate that under altered circumstances (e.g. with 
an enlarged sample size) the learning effect might indeed become 
relevant. An alternative is then to randomize the order of problem 
solution rather than to consider order as a separate factor in future 
experiments. Advantages of randomization over inclusion as a design 
factor are a greater flexibility in design and a larger degree of free- 


dom for estimating the error variance. 


3. A critical question in the planning of an experiment is 


determination of sample size. Our experiment may appear to be of 
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small scale; however, under the hypothesis that there are no actual 
system differences, the observed significance level reflects the actual 
sample size used for the particular experimental design. If there 

are indeed differences between the systems, a question arises as to 
the sensitivity of the experiment for detecting such differences. 

In this context questions of sample size become relevant. One possi- 
ble index of sensitivity can be obtained from power curves, which 

give the probabilities of observing various significance levels as 
functions of variance, true difference in mean responses and sample 
size. 

Our initial experiment provides us with an estimate of the 
variance for each response variable and enables the derivation of 
power curves for them. In our experiment the response variables 
elapsed time, programmer time, and computer time, exhibit almost 
identical sample coefficients of variation and therefore the same set 
of power curves are applicable to all three. For example, Figure 
IV-1 shows a set of power curves based on significance levels of . 05 
for these responses. The abscissa shows the mean difference in 
the response between systems, expressed as a percentage of the 
mean response for both systems. The ordinate indicates the power 
or probability of observing significance levels as small as .05. Thus, 
we see that in our experiment the probability of detecting differences 
of 40% at the .05 significance level, was only about .4. Increasing 


the experiment to 6x6 (i.e., 6 programmers each solving 6 problems) 


ior 
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Figure IV-1| 


Power Curves at .O5 Level 


For Various Sample Sizes (n) 
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would double the probability of detecting differences of 40% and 
would enable us to detect differences as small as 30% with prob- 
ability one-half. Although enlargement of sample size increases 
sensitivity, the limiting factor is usually economic in nature, so 
what is sought is a trade-off involving allocation of resources and 
the attainment of prescribed levels of sensitivity. 

The power curves depicted in Figure IV-1 thus provide a 
basis for deciding how to allocate resources to further experiments 
of this nature; they provide also a measure of how well we have 


done statistically in the initial experiment. 
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